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TITLE OF THE INVENTION 

SPEECH RECOGNITION SYSTEM AND METHOD 
BACKGROUND OF THE INVENTION 
FIELD OF THE INVENTION 

5 The present invention relates to a speech recognition 

system capable of allowing electronic equipments to be 
controlled or manipulated with uttered voices or speeches , 
and a speech recognition method for use in such a speech 
recognition system. 

10 DESCRIPTION OF THE RELATED ART 

Known speech recognition systems of this type are 
adapted to electronic equipments, such as an on-board audio 
system and an on-board navigation system. 

In an on-board audio system equipped with a speech 

15 recognition system, when a passenger says the name of a 

desired radio broadcasting station, for example, the speech 
recognition system recognizes the uttered speech and 
automatically tunes to the reception frequency of the radio 
broadcasting station based on the recognition result. This 

20 improves the operability of the on-board audio system and 
makes it easier for a passenger to use the on-board audio 
system. 

This speech recognition system also has other 
capabilities that relieve a passenger of the burden of 
25 operating an MD (Mini Disc) player and/ or CD (Compact Disc) 
player. When the passenger loads an information- carrying 
recording/ reproducing medium, such as an MD disc, into the 



MD player and says the title of a musical piece recorded on 
that recording/reproducing medium, for example, the speech 
recognition system recognizes the uttered speech and 
automatically plays the selected musical piece. 
5 An on -board navigation system equipped with a speech 

recognition system is provided with a capability of 
recognizing a speech uttered by a driver or the like to 
specify the name of the destination and displaying a map 
showing the route from the present location to the 

10 destination. This capability allows the driver to 

concentrate on driving a vehicle, thus ensuring safer 
driving environments . 

The above -described conventional speech recognition 
systems are designed to cope with a single person who utters 

15 words of instructions. The conventional speech recognition 
systems therefore have only a single microphone for 
inputting speeches provided at a location nearest to a 
driver who is very likely to use the microphone. 

Other passengers who are seated far from the microphone 

20 should therefore utter large voices toward the microphone to 
secure a sufficient input voice level. To improve the 
speech recognition precision of such a speech recognition 
system, other passengers than the driver should also utter 
large voices toward the microphone to input uttered speeches 

25 into the microphone without being affected by noise in a 
vehicle . 

SUMMARY OF THE INVENTION 
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Accordingly, it is an object of the present invention 
to provide a speech recognition system which has an improved 
operability and can allow more than one person to secure a 
sufficient input voice level without uttering large voices 
5 or without being affected by ambient noise. 

It is another object of this invention to provide a 
speech recognition method for use in a speech recognition 
system, which improves the operability of the speech 
recognition system. 

10 To achieve the first object, according to one aspect of 

this invention, there is provided a speech recognition 
system which comprises a plurality of voice pickup sections 
for picking up uttered voices; a determination section for 
determining a speech signal suitable for speech recognition 

15 from speech signals output from the plurality of voice 
pickup sections; and a speech recognizer for performing 
speech recognition based on the speech signal determined by 
the determination section. 

According to another aspect of this invention, there is 

20 provided a speech recognition method for a speech 

recognition system having a plurality of voice pickup means 
for picking up voices, which comprises a determination step 
of determining a speech signal suitable for speech 
recognition from speech signals output from the plurality of 

25 voice pickup means; and a speech recognition step of 

performing speech recognition based on the speech signal 
determined by the determination step. 



In the speech recognition system or speech recognition 
method, that of the speech signals output from the plurality 
of voice pickup sections (voice pickup means) whose speech 
level is equal to or higher than a predetermined speech 
5 level and continues over a predetermined period of time may 
be determined as the speech signal suitable for speech 
recognition . 

It is preferable that the determination section (or 
step) acquires an average S/N value and average voice power 

10 of each of the speech signals output from the plurality of 

voice pickup sections (or voice pickup means) and determines 
that of the speech signal whose average S/N value and 
average voice power are greater than respective 
predetermined threshold values as the speech signal suitable 

15 for speech recognition. 

In this case, it is preferable that the determination 
section determines a candidate order of those speech signals 
whose average S/N values and average voice powers are 
greater than the respective predetermined threshold values 

20 and which are candidates for the speech signal suitable for 
speech recognition, in accordance with the average S/N 
values and average voice powers; and the speech recognizer 
sequentially executes speech recognition on the candidates 
in accordance with the candidate order from a highest 

25 candidate to a lower one. 

In any of the speech recognition system and method and 
their preferable modes, the determination section (or step) 



treats those of the speech signals which are other than the 
speech signal suitable for speech recognition as noise 
signals . 

In any of the speech recognition system and method and 
5 their preferable modes, of other speech signals than the 

speech signal suitable for speech recognition, that speech 
signal whose average S/N value and average voice power 
become minimum may be treated as a noise signal by the 
determination section. 
10 With the above structures, when a speaker makes a 

desired speech, a speech signal and a noise signal suitable 
for speech recognition are automatically determined from the 
individual speech signals output from a plurality of voice 
pickup sections (or voice pickup means) and speech 
15 recognition is carried out based on the determined speech 

signal and noise signal. Accordingly, the speaker has only 
to utter words or voices without consciously making such a 
speech to a specific voice pickup section. This leads to an 
improved operability of the speech recognition system. 
20 BRIEF DESCRIPTION OF THE DRAWINGS 

FIG. 1 is a block diagram illustrating the structure of 
a speech recognition system according to one embodiment of 
the present invention; 

FIG. 2A is a plan view exemplifying the layout of 
25 microphones in an ordinary 4-seat vehicle; 

FIG. 2B is a plan view showing another layout of 
microphones in an ordinary 4-seat vehicle; 



FIG. 3A is a plan view exemplifying the layout of 
microphones in a wagon or the like; 

FIG. 3B is a plan view showing another layout of 
microphones in a wagon or the like; 
5 FIG. 4 is a block diagram showing the structures of a 

multiplexer, a demultiplexer and a storage section; 

FIG. 5 is a timing chart for explaining the timings of 
sampling an input signal and storing sampled signals into a 
storage section; 
10 FIGS. 6A through 6D are explanatory diagrams for 

explaining how to compute an average voice power, an average 
noise power and an average S/N value; 

FIG. 7 is an explanatory diagram showing the structure 
of a speech condition table; 
15 FIG. 8 is an explanatory diagram showing the structure 

of a noise selection table; 

FIG. 9 is a flowchart for explaining the operation of 
the speech recognition system according to this embodiment; 

FIG. 10 is a flowchart for further explaining the 
20 operation of the speech recognition system according to this 
embodiment ; and 

FIG. 11 is a block diagram illustrating the structure 
of a modification of the speech recognition system according 
to this embodiment. 
25 DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENT 

With reference to the accompanying drawings, a 
description will now be given of a preferred embodiment of 



the present invention as adapted to a speech recognition 
system which can ensure voice- or speech-based control or 
manipulation of an electronic equipment installed in a 
vehicle, such as an on- board audio system or an on- board 
5 navigation system. 

FIG. 1 is a block diagram illustrating the structure of 
a speech recognition system according to this embodiment of 
this invention. Referring to this diagram, the speech 
recognition system comprises a plurality of microphones M 1 to 

10 M N as voice pickup means, a plurality of pre-circuits CC 1 to 
CC N , a multiplexer 1, an A/D (Analog-to-Digital) converter 
(ADC) 2, a demultiplexer 3, a storage section 4, a speech 
detector 5, a data analyzer 6, a speech recognizer 7, a 
controller 8 and a speech switch 9 . 

15 The pre-circuits CC^-CC,,, the multiplexer 1, the A/D 

converter 2, the demultiplexer 3, the storage section 4, the 
speech detector 5 , the data analyzer 6 and the controller 8 
constitute determination means which determines a speech 
signal and noise signal suitable for speech recognition. 

20 The single speech switch 9 is provided in the vicinity 

of a driver seat, for example, on a front dash board or one 
end of a front door by the driver seat. 

The controller 8 has a microprocessor (MPU), which 
controls the general operation of this speech recognition 

25 system. When the speech switch 9 is switched on, sending an 
ON signal SW to the microprocessor, the microprocessor 
causes the microphones M X -M N to initiate a voice pickup 



operation. 

The speech detector 5 has number- of -speeches counters 
FCi-FCf, that are used to determine to which microphone an 
uttered speech is directed, though their details will be 
5 given in a later description of the operation of the speech 
recognition system. 

The individual microphones M 1 -M N are provided at 
locations where it is easy to pick up speeches uttered by 
individual passengers , e.g., in the vicinity of the 
10 individual passenger seats including the driver seat. 

In one example where four microphones M^M^ are placed 
in a 4 -seat vehicle, the microphones M a and M 2 are placed in 
front of the driver seat and the front passenger seat and 
the microphones M 3 and M 4 are placed in front of the rear 
15 passenger seats, e.g., the corresponding roof portions or at 
the back of the driver seat and the front passenger seat as 
shown in a plan view of FIG. 2A. This way, the individual 
microphones M 1 -M 4 are associated with the respective 
passengers . 

20 In another example as shown in a plan view of FIG. 2B, 

the microphones M 1 and M 2 may be placed in the front door by 
the driver seat and the front door by the front passenger 
seat and the microphones M 3 and M 4 are placed in the rear 
doors by the respective rear passenger seats, so that the 

25 individual microphones M^l^ are associated with the 
respective passengers . 

In a further example, the microphones M x -M 4 may be 



provided at combined locations shown in FIGS. 2A and 2B. 
Specifically, the microphone M x is placed in front of the 
driver seat as shown in FIG. 2A or in the front door by the 
driver seat as shown in FIG. 2B, so that a single microphone 
5 is provided for the driver who sits on the driver seat. 
Likewise, either the location shown in FIG. 2A or the 
location shown in FIG. 2B is selected for any of the 
remaining microphones M 2 -M 4 . 

In the case of a wagon type vehicle or the like which 
10 holds a greater number of seats, for example, a greater 

number of microphones M^Mg are provided in accordance with 
the seats and at the locations where it is easy to pick up 
speeches uttered by individual passengers, as shown in plan 
views of FIGS. 3A and 3B. Note that the microphones M 1 -M 6 
15 may be provided at combined locations shown in FIGS. 3A and 
3B as per the aforementioned case of the 4 -seat vehicle. 

It is to be noted that the aforementioned microphone 
layouts have been given simply as examples, and are to be 
considered as illustrative and not restrictive. Actually, 
20 system information that is used in the speech recognition 
system of this invention is constructed beforehand in 
consideration of the characteristics of voice transmission 
from individual passengers to the respective microphones . 
Strictly speaking, therefore, the conditions for setting the 
25 microphones are not restricted at all. Further, the number 
of microphones can be determined to be equal to or smaller 
than the number of maximum passengers predetermined in 



accordance with the type of a vehicle. 

The layout of the individual microphones is not limited 
to a simple layout that makes the distances between the 
microphones to the respective passengers equal to one 
5 another. Those distances and the locations of the 

individual microphones may be determined based on the 
results of analysis of the voice characteristics in a 
vehicle previously acquired through experiments or the like 
in such a way that the characteristics of voice transmission 

10 from the microphones to the respective passengers become 
substantially the same . 

Returning to FIG. 1, the microphones M^Mj, are connected 
to the respective pre-circuits CC 1 -CC It , thus constituting N 
channels of signal processing systems. 

15 Each of the pre-circuits CC^-CC,, has an amplifier (not 

shown) which amplifies the amplitude level of the associated 
one of input speech signals S-l to S N , supplied from the 
microphones M L -M N , to the level that is suitable for signal 
processing, and a band-pass filter (not shown) which passes 

20 only a predetermined frequency component of the amplified 
input speech signal. The pre-circuits CCi-CC^ supply input 
speech signals Sj^ ' to S N ' , which have passed the respective 
band-pass filters, to the multiplexer 1. 

Each band-pass filter is set with a low cut-off 

25 frequency f L (e.g., f L = 100 Hz) for eliminating low- 
frequency noise included in the associated one of the input 
speech signals S^S^ and a high cut-off frequency f H in 



consideration of the Nyquist frequency. The low cut-off 
frequency f L and high cut-off frequency f H are set so that 
the frequency range of voices that human beings utter is 
included in the range between those two frequencies . 
5 As shown in FIG. 4, the multiplexer 1 comprises analog 

switches AS ± to AS N for N channels . The input speech signals 
S^-Sf,' from the pre-circuits CC^-CCn are supplied to the 
input terminals of the respective analog switches AS^ASn 
whose output terminals are connected together to the A/D 
10 converter 2 . In accordance with channel switch signals CH L 
to CH N supplied from the controller 8, the analog switches 
ASj^-ASh exclusively switch the input speech signals S 1 ' -S N ' 
and supply the switched input speech signals S 1 '-S N ' to the 
A/D converter 2 . 

15 The A/D converter 2 convert the input speech signals 

s i' -S n'' sequentially supplied from the multiplexer 1, to 
digital input data D x to D N in synchronism with a 
predetermined sampling frequency f , and supplies the digital 
input data D,l-D n to the demultiplexer 3. 

20 The sampling frequency f is set by a sampling clock 

CK ADC from the controller 8 and is determined in 
consideration of anti-aliasing. More specifically, the 
sampling frequency f is determined to be equal to or higher 
than approximately twice the high cut-off frequency f H of the 

25 band-pass filter, and is set, for example, in a range of 8 
kHz to 11 kHz. 

The demultiplexer 3 comprises analog switches AW t to AW N 
- 11 - 



for N channels, as shown in FIG. 4. The analog switches AW X - 
AW N have their input terminals connected together to the 
output terminal of the A/D converter 2 and their output 
terminals respectively connected to memory areas ME 1 to ME N 
5 for N channels provided in the storage section 4. In 

accordance with the channel switch signals CH X -CH N supplied 
from the controller 8, the analog switches AW X -AW N 
exclusively switch the input data D X -D N and supply the 
switched input data D-^D^, to the respective memory areas ME 1 - 
10 ME N . 

Referring now to the timing chart in FIG. 5, the 
operations of the multiplexer 1, the A/D converter 2 and the 
demultiplexer 3 will be explained. When the speech switch 9 
is set on, the resultant ON signal SW is received by the 

15 controller 8 which in turn outputs the sampling clock CK ADC 
and the channel switch signals CH 1 -CH N . 

The sampling clock CK^ has a pulse waveform which 
repeats the logical inversion N times during a period 
(sampling period) T which is the reciprocal, 1/f , of the 

20 sampling frequency f . The channel switch signals CI^-CHn 
have pulse waveforms which sequentially become logic "1" 
every period T/N of the sampling clock CK ADC . 

The multiplexer 1 exclusively performs switching 
between enabling and disabling of the input speech signals 

25 Si'-Sjj' in synchronism with the period T/N in which the 

channel switch signals Cf^-CHf, sequentially become logic "1". 
As a result, the input speech signals Sj^-Sj/ are 



sequentially supplied to the A/D converter 2 in synchronism 
with the period T/N to be converted to the digital data D^Dj,. 
The demultiplexer 3 likewise exclusively performs switching 
between enabling and disabling of the input data D^D^, in 
5 synchronism with the period T/N in which the channel switch 
signals CK^-CHf, sequentially become logic "1". Accordingly, 
the input data D X -D N from the A/D converter 2 are distributed 
and stored in the respective memory areas ME^MEj, in 
synchronism with the period T/N. 

10 As sampling N channels of input speech signals S 1 ' -S^' 

in the sampling period T (= 1/f) is repeated this way, it is 
possible to generate N channels of input data D^D,, with even 
the single A/D converter 2 in synchronism with the sampling 
frequency f and to store the input data D X -D N into the 

15 predetermined memory areas ME^MEn, respectively. 

The storage section 4, which is constituted by a 
semiconductor memory, has the aforementioned memory areas 
ME-L -ME N for N channels. That is, the memory areas ME^MEp, are 
provided in association with the microphones M X -M N . 

20 As shown in FIG. 4, each of the memory areas ME^MEn has 

a plurality of frame areas MF X , MF 2 and so forth for storing 
the associated one of the input data D t -D H frame by frame of 
a predetermined number of samples. 

Referring to the memory area ME 2 , for example, the 

25 frame areas MF 1# MF 2 and so forth sequentially store the 
input data D x supplied from the demultiplexer 3 by a 
predetermined number of samples (256 samples in this 



embodiment) in accordance with an address signal AD^ from 
the controller 8. That is, every 256 samples of the input 
data D-l are stored in each frame area MF 17 MF 2 or the like in 
each frame period TF which is 256 x T as shown in FIG. 5. 
5 Input data for one frame period ( 1TF) , which is stored in 

each frame area MF 1# MF 2 or the like, is called "frame data". 

Likewise, the input data D 2 -D N are stored, 256 samples 
each, in the frame area MFj , MF 2 and so forth in the 
remaining memory areas ME 2 -ME N in each frame period TF. 
10 The speech detector 5 and the data analyzer 6 are 

constituted by a DSP (Digital Signal Processor) . 

Every time frame data is stored in the frame area MF 1# 
MF 2 and so forth in each of the memory areas ME-L-MEp,, the 
speech detector 5 computes the LPC (Linear Predictive 
15 Coding) residual of the latest frame data and determines if 
the computed value is equal to or greater than a 
predetermined threshold value THD1. When the computed value 
becomes equal to or greater than the predetermined threshold 
value THD1, the speech detector 5 determines that the latest 
20 frame data is speech frame data produced from a speech. 

When the computed value is smaller than the predetermined 
threshold value THD1, the speech detector 5 determines that 
the latest frame data is input data that has not been 
produced from a speech, i.e., noise frame data that has been 
25 produced by noise in a vehicle. 

When the computed LPC residual value becomes equal to 
or greater than the predetermined threshold value THD1 over 



three frame periods ( 3TF) , the speech detector 5 settles 
that the frame data over the three frame periods (3TF) is 
definitely speech frame data produced from a speech and 
transfers speech detection data DCT1 indicative of the 
result of the decision to the controller 8. 

More specifically, the LPC residuals of frame data 
stored in the individual frame area MF 1 , MF 2 and so forth in 
each of the memory areas ME^MEn are individually computed 
channel by channel, and each channel -by -channel computed LPC 
residual value is compared with the threshold value THD1 to 
determine, channel by channel, if the frame data is speech 
frame data produced from a speech. 

Given that £ L is the computed LPC residual value of the 
first channel associated with the microphone M lf £ 2 is the 
computed LPC residual value of the second channel associated 
with the microphone M 2 and likewise £ 3 to £ N are the 
computed LPC residual values of the third to N-th channels 
respectively associated with the microphones M 3 -M N , the 
computed values e^e, are compared with the threshold value 
THD1. The frame data that corresponds to the channel whose 
computed LPC residual value becomes equal to or greater than 
the threshold value THD1 is determined as speech frame data 
that has been generated from a speech. Further, the speech 
frame data that corresponds to the channel whose computed 
LPC residual value becomes equal to or greater than the 
threshold value THD1 over three frame periods (3TF) is 
settled as speech frame data that is definitely generated 
- 15 - 



from a speech. 

When a speech has been directed to the microphone M x 
and the uttered voices have not been input to the remaining 
microphones M 2 -M N , for example, only the frame data that is 
stored in the memory area ME 1 of the channel associated with 
the microphone M x is determined and settled as speech frame 
data that has been produced from the speech, and the frame 
data stored in the memory areas ME 2 -ME N associated with the 
remaining microphones M 2 -M N are determined as noise frame 
data generated from noise in the vehicle. 

When a speech has been directed to the microphone M L 
and the uttered voices have reached the microphone M 2 but not 
the remaining microphones M 3 -M N , for example, the frame data 
stored in the memory areas ME X and ME 2 of the channels 
associated with the microphones M 1 and M 2 are both determined 
and settled as speech frame data produced from the speech, 
and the frame data stored in the memory areas ME 3 -ME N 
associated with the remaining microphones M 3 -M N are 
determined as noise frame data. 

In the above- described manner, the speech detector 5 
computes the LPC residual of each of the frame data stored 
in the memory areas ME^MEn, compares it with the threshold 
value THD1 to determine if uttered voices have been input to 
any microphone and determine the frame period in which the 
uttered voices have been input, and transfers the speech 
detection data DCT1 having information on those decisions to 
the controller 8. 



The speech detection data DCT1 is transferred to the 
controller 8 as predetermined code data which indicates the 
memory area where speech frame data has been stored over the 
aforementioned three frames or more (hereinafter this memory 
area will be called "speech memory channel") and its frame 
area (hereinafter called "speech memory frame"). 

Specifically, the speech detection data DCT1 has an 
ordinary data structure of, for example, DCT1{CH 1 (TF 1 , TF 2 - 

TFJ, CH 2 (TF lf TF 2 -TF m ) CH N (TF lf TF 2 -TF m ) } . CH.-CH,, are 

flag data representing the individual channels, and TF lf TF 2 - 
TF m are flag data corresponding to the individual frame areas 
MF 1F MF 2 -MF m . 

When an uttered speech is input only to the microphone 
M x and speech frame data is stored in the third and 
subsequent frame areas MF 3 , MF 4 and so forth, speech 
detection data DCT of binary codes of DCT1{ 1 (0 , 0 , 1 , 1-1 ) , 
0(0,0,0-0), 0(0,0,0-0) is transferred to the controller 

8. 

When the speech detection data DCT1 is transferred to 
the controller 8, the controller 8 generates control data 
CNT1 indicating the speech memory channel and speech memory 
frame based on the speech detection data DCT1, and sends the 
control data CNT1 to the data analyzer 6 

The data analyzer 6 comprises an optimal -speech 
determining section 6a, a noise determining section 6b, an 
average- S/N computing section 6c, an average -voice -power 
computing section 6d, an average-noise-power computing 



section 6e, a speech condition table 6f and a noise 
selection table 6g. When receiving the control data CNT1 
from the controller 8, the data analyzer 6 initiates a 
process of determining speech frame data and noise frame 
data suitable for speech recognition. 

The average-voice-power computing section 6d acquires 
information on the speech memory channel and speech memory 
frame from the control data CNT1, reads speech frame data 
from the memory area that corresponds to those speech memory 
channel and speech memory frame and computes average voice 
power P(n) of the speech frame data channel by channel. The 
variable n in the average voice power P(n) indicates a 
channel number. 

When speech frame data is stored in the memory areas 
ME 1 -ME 4 corresponding to the channels CH 1 -CH 4 as shown in FIGS. 
6A to 6D, for example, the average voice power P(l) to P(4) 
of plural pieces of speech frame data corresponding to a 
plurality of predetermined frame periods (m 2 x TF) from a 
time t s at which a speech has started are computed channel by 
channel. The average voice power P(n) is computed by 
obtaining the sum of squares of speech frame data in the 
frame periods (m 2 x TF) and then dividing the sum by the 
number of the frame periods (m 2 x TF) . 

The average-noise-power computing section 6e acquires 
information on the speech memory channel and speech memory 
frame from the control data CNT1, reads noise frame data 
preceding the speech frame data by a plurality of frame 



periods (n^ x TF) from the memory area that corresponds to 
those speech memory channel and speech memory frame and 
computes average noise power NP(n) of the noise frame data 
channel by channel. The variable n in the average noise 
5 power NP(n) indicates a speech channel, and the average 
noise power NP(n) is computed by obtaining the sum of 
squares of noise frame data in the frame periods x TF) 

and then dividing the sum by the number of the frame periods 
( m 1 x TF ) . 

10 When speech frame data is stored in the memory areas 

ME 1 -ME 4 corresponding to the channels CH 1 -CH 4 as shown in FIGS. 
6A to 6D, for example, the average noise power NP(n) of 
plural pieces of noise frame data preceding by a plurality 
of frame periods (m x x TF) from the time t s at which a speech 
15 has started (at which storage of the speech frame data has 
started) are computed. 

The average- S/N computing section 6c computes an 
average S/N value SN(n) which represents the value of the 
signal-to-noise ratio for each speech channel based on the 
20 average voice power P(n) computed by the average-voice-power 
computing section 6d and the average noise power NP(n) 
computed by the average- noise -power computing section 6e. 

In the case where the channels CI^-CI^ are speech 
channels as shown in FIGS. 6A to 6D, for example, the 
25 average S/N values SN(1) to SN(4) of the individual channels 
CH-l-CHLi are computed from the following equations 1 to 4 . 
SN(1) = P(1)/NP(1) ... (1) 
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SN(2) = P(2)/NP(2) ... (2) 

SN(3) = P(3)/NP(3) ... (3) 

SN(4) = P(4)/NP(4) ... (4) 

Logarithmic values of the average S/N values SN(1) to 
5 SN(4) computed from the equations 1 to 4 may be taken as the 
average S/N values SN(1)-SN(4) of the individual channels 
CHi-Cf^. 

The optimal- speech determining section 6a compares the 
average S/N value SN(n) acquired by the average -S/N 

10 computing section 6c with a predetermined threshold value 

THD2, and compares the average voice power P(n) acquired by 
the average -voice -power computing section 6d with a 
predetermined threshold value THD3. The optimal -speech 
determining section 6a then collates the results of the 

15 comparison with the speech condition table 6f shown in FIG. 
7 to determine which channel of speech frame data is 
suitable for the speech recognition process. 

As shown in FIG. 7, the speech condition table 6f is 
storing reference data for ranking speech frame data in 

20 accordance with the relationship between the average S/N 
value and the threshold value THD2 and the relationship 
between the average voice power and the threshold value THD3 . 
Referring to the speech condition table 6f based on the 
comparison results, the optimal-speech determining section 

25 6a ranks the speech frame data suitable for speech 

recognition and determines the speech frame data of the 
highest rank as the one suitable for speech recognition. 



Specifically, the' optimal-speech determining section 6a 
determines the speech frame data whose average S/N value is 
equal to or greater than the threshold value THD2 and whose 
average voice power is equal to or greater than the 
threshold value THD3 as a rank 1 (Rnkl), determines the 
speech frame data whose average S/N value is equal to or 
greater than the threshold value THD2 and whose average 
voice power is less than the threshold value THD3 as a rank 
2 (Rnk2), determines the speech frame data whose average S/N 
value is smaller than the threshold value THD2 and whose 
average voice power is equal to or greater than the 
threshold value THD3 as a rank 3 (Rnk3), and determines the 
speech frame data whose average S/N value is smaller than 
the threshold value THD2 and whose average voice power is 
less than the threshold value THD3 as a rank 3 (Rnk3). 

Further, the optimal- speech determining section 6a 
determines the speech frame data in all the channels of 
speech frame data whose average S/N value and average voice 
power become maximum as a rank 0 ( RnkO ) . 

Then, the optimal- speech determining section 6a 
determines the speech frame data that becomes the rank 0 
(RnkO) as a candidate most suitable for speech recognition 
(first candidate). Further, the optimal- speech determining 
section 6a determines the speech frame data that becomes the 
rank 1 (Rnkl) as the next candidate suitable for speech 
recognition (second candidate). When there are a plurality 
of channels whose speech frame data become the rank 1 (Rnkl), 
- 21 - 



those speech frame data which have greater average S/N 
values and greater average voice powers are determined as 
candidates of higher ranks. 

Further, the optimal -speech determining section 6a 
removes the speech frame data that correspond to the rank 2 
(Rnk2) to the rank 4 (Rnk4) from the targets for speech 
recognition, considering that they are unsuitable for speech 
recognition. 

In short, the optimal- speech determining section 6a 
compares the average S/N value SN(n) and the average voice 
power P(n) with the threshold values THD2 and THD3 
respectively, collates the comparison results with the 
speech condition table 6f shown in FIG. 7 to determine the 
speech frame data that is suitable for speech recognition, 
and then puts a priority order or ranking to speech frame 
data suitable for speech recognition. Then, the optimal- 
speech determining section 6a transfers speech candidate 
data DCT2 indicating the ranking to the controller 8. 

The noise determining section 6b collates combinations 
of all the ranks for N channels that are acquired by the 
optimal- speech determining section 6a with the noise 
selection table 6g shown in FIG. 8, and determines any 
channel for which the ranking combination has a match as a 
noise channel. 

When the ranks of the individual channels starting at 
the first channel CI^ are (RnkO) , (Rnkl), (Rnk2), (Rnkl), ... 
for example, the noise determining section 6b determines the 



third channel CH 3 as a noise channel. Then, the noise 
determining section 6b sends noise candidate data DCT3 to 
the controller 8. 

When the optimal -speech determining section 6a 
determines a candidate of speech frame data suitable for 
speech recognition, the noise determining section 6b 
determines a noise channel corresponding to the candidate of 
speech frame data suitable for speech recognition by 
referring to the individual "cases" in FIG. 8. Accordingly, 
a candidate of speech frame data suitable for speech 
recognition and noise data obtained by the microphone that 
has picked up noise are determined in association with each 
other. 

The individual cases 1, 2, 3 and so forth in the noise 
selection table 6g in FIG. 8 are preset based on the results 
of experiments on the voice characteristics obtained when 
passengers actually uttered voices at various positions in a 
vehicle in which all the microphones Mj-Mj, were actually 
installed. 

When the speech candidate data DCT2 and the noise 
candidate data DCT3 are supplied to the controller 8, the 
controller 8 accesses that of the memory areas ME^MEn which 
corresponds to the channel of the first candidate based on 
the speech candidate data DCT2 , reads the speech frame data 
most suitable for speech recognition and supplies it to the 
speech recognizer 7 . 

The speech recognizer 7 performs known processes, such 



as SS (Spectrum Subtraction), echo canceling, noise 
canceling and CMN, based on the speech frame data and noise 
frame data supplied from the storage section 4 to thereby 
eliminate a noise component from the speech frame data, 
performs speech recognition based on the noise-component 
removed speech frame data and outputs data Dout representing 
the result of speech recognition. 

If an adequate speech recognition result is not 
acquired from the speech recognition performed by the speech 
recognizer 7 based on speech frame data and the noise frame 
data suitable for speech recognition, the controller 8 
accesses the memory area that corresponds to the channel of 
the next candidate suitable for speech recognition and 
transfers the corresponding speech frame data to the speech 
recognizer 7. Thereafter, the controller 8 supplies speech 
frame data of the channels of subsequent candidates in order 
to the speech recognizer 7 until the adequate speech 
recognition result is acquired. 

An example of the operation of this speech recognition 
system which has the above-described structure will be 
discussed with reference to the flowcharts shown in FIGS. 9 
and 10. FIG. 9 illustrates an operational sequence from the 
pickup of sounds with the microphones M^M^, to the storage of 
the input data D X -D N into the storage section 4 as frame data, 
and FIG. 10 illustrates the operation at the time the data 
analyzer 6 determines optimal speech frame data and noise 
frame data. 



In FIG. 9, the speech recognition system stands by 
until the speech switch 9 is switched on in step 100. Upon 
occurrence of the ON event of the speech switch 9, the flow 
goes to step 102 to perform initialization. This 
initializing process clears a count value n of a channel- 
number counter, a count value m of a frame -number counter 
and all values F(l) to F(N) of the number- of -speeches 
counters FC^FC,,, all provided in the controller 8. 

The channel -number counter is provided to designate 
each of the channels of the microphones Mj^Mn with the count 
value n. The frame-number counter is provided to designate 
the number (address) of each of the frame areas MF X , MF 2 , MF 3 
and so forth, provided in the each of the memory areas ME L - 
ME N , with the count value m. 

N number- of- speeches counters FC^-FCn are provided in 
association with the individual channels. That is, the 
first number- of -speeches counter FC X is provided in 
association with the first channel, the second number-of - 
speeches counter FC 2 is provided in association with the 
second channel, and so forth to the N-th number- of -speeches 
counter FC N provided in association with the N-th channel. 
The number-of -speeches counters FC^-FC^ are used to determine 
whether or not an LPC residual £ n greater than the threshold 
value THD1 has consecutively continued over three or more 
frames and to determine the channel for which the LPC 
residual £ n has continued over three or more frames . The 
number-of- speeches counters FC^-FC^ are also used to 



determine, as a speech-input channel, the channel for which 

the LPC residual £ n has continued over three or more frames. 
In the next step 104, the first frame area MF 1 of each 

of the memory areas MEj-ME,, is set. That is, the number, m. 

of the frame area is set to m = 1. 

In subsequent steps 106 and 108, the microphones l^-M,, 

start picking up sounds and the input data D X -D M acquired by 
the voice pickup are stored in the individual first frame 
areas MF 1 of the memory areas ME^MEn frame by frame. 

When one frame of input data D^D^ is stored, the memory 
area ME X that corresponds to the first (n = 1) channel is 
designated in step 110, and the LPC residual e n (n = 1) of 
frame data stored in the first (m = 1) frame area MF L of the 
memory area ME^ is computed in step 112. 

In the next step 114, the LPC residual £ n is compared 
with the threshold value THD1 . When £ n ^ THD1 , the flow 
goes to step 116 to increment (or adds "1" to) the value 
F(l) of the number- of -speeches counter FC X corresponding to 
the first channel by "1". When £ n < THDl, the flow goes to 
step 118 to clear the value F(l) of the number- of -speeches 
counter FC X . 

When £ a becomes equal to or greater than THDl ( £ n ^ 
THDl), therefore, the value F(l) of the number-of -speeches 
counter FC X becomes "1" which indicates that one frame of 
speeches has been input to the microphone M L of the first 
channel . 

When £ n becomes smaller than THDl (e n < THDl), on the 



other hand, the value F(l) of the number-of -speeches counter 
FC X is cleared to "0" which indicates that no speeches have 
been input to the microphone M 2 of the first channel. 

Next, it is checked if n is equal to N (n = N) in step 
120 to determine whether the LPC residual £ n in every 
channel has been computed. When n = N is not met, the flow 
goes to step 122 to make n = n + 1 to set the next channel, 
and the sequence of processes from step 112 is repeated. 
That is, by repeating the processes of steps 112 to 122, the 
LPC residual £ n of frame data stored in the frame area MF 1 
of each of the memory areas MEj^-MEj, is compared with the 
threshold value THD1 . When the LPC residual £ n becomes 
equal to or greater than the threshold value THD1, the value 
F(n) of the number-of -speeches counter FC X corresponding to 
that channel number n is incremented by "1". 

When n = N is met in the aforementioned step 120, it is 
determined that the processing for all the channels has been 
completed, then the flow proceeds to step 124. 

In step 124, it is determined if any one of the values 
F(l) to F(N) of the number-of -speeches counters FQ-FC,, has 
become equal to or greater than "3". If there is no such a 
count value, i.e., if any of the values F(l) to F(N) is 
equal to or smaller than "2", the flow goes to step 126. 

In step 126, the individual second frame areas MF 2 of 
the memory areas MF^-ME,, 1 are set by setting m = m + 1. 
Then, the processes of steps 106 to 124 are repeated. 

Accordingly, the input data is stored in each frame 



area MF 2 (steps 106 and 108), the LPC residual a a of each 
frame data stored in each frame area MF 2 is compared with the 
threshold value THD1 (steps 110 to 114), and each of the 
values F(l) to F(N) of the number -of -speeches counters FC X - 
FC N is incremented or cleared based on the comparison results. 

In step 124, it is determined again if any one of the 
values F(l) to F(N) of the number- of -speeches counters FC X - 
FC N has become equal to or greater than "3". If there is no 
such a count value, the flow goes to step 126 to set m = m + 
1 so that the next frame areas MF 3 of the memory areas ME L - 
ME N 1 are set. Then, the processes of steps 106 to 124 are 
repeated . 

As the processes of steps 106 to 124 are repeated and 
at least one of the values F(l) to F(N) of the number-of- 
speeches counters FC X -FC N becomes equal to or greater than 
"3", the flow proceeds to step 128. 

In other words, in step 124, the values F(l) to F(N) of 
the number- of -speeches counters FC^-FC^ are checked and only 
when the LPC residual £ n greater than the threshold value 
THD1 consecutively continues over three or more frames, 
frame data stored in the memory area corresponding to that 
channel is determined and settled as speech frame data. 

In the next step 128, it is determined if the value of 
the number-of -speeches counter for which it was determined 
the LPC residual £ n greater than the threshold value THD1 
consecutively continued over three or more frames has 
reached "5". If that value has not reached "5" yet, the 



process in step 126 is carried out after which the processes 
of steps 106 to 128 are repeated. 

There may be a case where when the value F(n) of the 
number-of -speeches counter that corresponds to a given 
channel n becomes "3", the value of the number-of -speeches 
counters corresponding to the remaining channels is "1" or 
"2". In this case, frame data stored in the memory areas 
corresponding to the remaining channels are likely to be 
also speech frame data. 

To cope with this case, therefore, the processes of 
steps 106 to 128 are repeated twice to check if the frame 
data stored in the memory areas corresponding to the 
remaining channels are speech frame data. 

When the decision in step 128 is "YES" , the flow goes 
to step 130 where the speech detection data DCT1 which has 
information on the memory area where speech frame data is 
stored and the memory area where noise frame data is stored 
is transferred to the controller 8. The flow then proceeds 
to a routine illustrated in FIG. 10. 

When the operation goes to the routine illustrated in 
FIG. 10, the average voice power P(n) , the average noise 
power NP(n) and the average S/N value SN(n) for each channel 
are computed first in step 200. Next, a candidate of speech 
frame data suitable for speech recognition is determined 
based on the speech condition table 6f shown in FIG. 7 in 
step 202. In the next step 204, noise frame data suitable 
for speech recognition is determined based on the noise 



selection table 6g shown in FIG. 8. 

In step 206, the speech candidate data DCT2 that 
indicates the candidate of speech frame data suitable for 
speech recognition and the noise candidate data DCT3 that 
indicates the noise frame data are sent to the controller 8 
from the data analyzer 6. In other words, the speech 
candidate data DCT2 and the noise candidate data DCT3 inform 
the controller 8 of the candidate of speech frame data 
suitable for speech recognition and noise frame data 
suitable for speech recognition associated with that 
candidate . 

In the next step 208, the speech recognizer 7 read the 
speech frame data and noise frame data most suitable for 
speech recognition from the storage section 4, performs 
speech recognition on the read speech frame data and noise 
frame data, and terminates a sequence of speech recognition 
processes when an adequate speech recognition result is 
acquired . 

When no adequate speech recognition result is acquired, 
on the other hand, the speech recognizer 7 checks in step 
212 if there are next candidates of speech frame data and 
noise frame data, reads the next candidates of speech frame 
data and noise frame data, if present, from the storage 
section 4 and repeats the sequence of processes starting at 
step 208. When no adequate speech recognition result is 
obtained even after re-execution of the speech recognition, 
the speech recognizer 7 likewise reads next candidates of 



speech frame data and noise frame data from the storage 
section 4 and repeats the sequence of processes in steps 208 
to 212 until the adequate speech recognition result is 
obtained . 

According to this embodiment, as apparent from the 
above, a plurality of microphones M L -M N for inputting voices 
are placed in a vehicle and speech frame data and noise 
frame data suitable for speech recognition are automatically 
extracted from those speech frame data and noise frame data 
that are picked up by the microphones Mj-M,, and are subjected 
to speech recognition. This speech recognition system can 
therefore provide a plurality of speakers (passengers) with 
a better operability than the conventional speech 
recognition system that is designed for a single speaker. 

When one of a plurality of passengers directs a desired 
speech to a certain microphone (e.g., MJ , the uttered speech 
may generally be picked up by the other microphones (M 2 -M N ) 
so that it is difficult to determine which microphone has 
actually been intended to pick up the uttered speech. 
According to this embodiment, however, speech frame data and 
noise frame data suitable for speech recognition are 
automatically extracted by using the speech condition table 
6f and the noise selection table 6g, respectively shown in 
FIGS. 7 and 8, and speech recognition is carried out based 
on the extracted speech frame data and noise frame data. 
This makes it possible to associate the passenger who has 
made the speech with the microphone (e.g., MJ close to that 



passenger with a very high probability. 

Accordingly, this speech recognition system 
automatically specifies a passenger who tries to perform a 
voice-based manipulation of an electronic equipment 
installed in a vehicle and allows the optimal microphone 
(close to the passenger) to pick up the uttered speech. 
This can improve the speech recognition precision. With the 
use of this speech recognition system, a passenger requires 
a special manipulation but merely needs to utter words to 
give this or her voiced instruction through the appropriate 
microphone, so that this speech recognition system is 
considerably easy to use. 

Suppose that while one or more passengers who do not 
intend to perform a voice-based manipulation of an on-board 
electronic equipment are making a conversation or the like, 
one person utters words to perform such a voice-based 
manipulation. Even in this case, the conversation or the 
like made by the passengers who are not performing the 
voice-based manipulation is determined as noise and 
eliminated from consideration by automatically extracting 
speech frame data and noise frame data suitable for speech 
recognition by using the speech condition table 6f and the 
noise selection table 6g, respectively shown in FIGS. 7 and 
8, and then carrying out speech recognition based on the 
extracted speech frame data and noise frame data. This can 
provide a speech recognition system which is not affected by 
a conversation or the like taking place in a vehicle around 
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and which is very easy to use. 

Although this embodiment is provided with the single 
speech switch 9, shown in FIG. 1, which is switched on by, 
for example, a driver, this invention is not limited to this 
particular structure. For example, a plurality of 
microphones M L -M N may be respectively provided with speech 
switches TK L to TK N as shown in the block diagram in FIG. 11, 
so that when one of the speech switches is set on, the 
controller 8 allows the microphone that corresponds to the 
activated speech switch to pick up words and determines that 
the remaining microphones corresponding to the inactive 
speech switches have picked up noise in the vehicle. 

This modified structure can specify the microphone that 
has picked up an uttered speech and the microphones that 
have picked up noise before speech recognition. This can 
shorten the processing time for easily determining speech 
data and noise data most suitable for speech recognition. 

Further, the structure shown in FIG. 1 and the 
structure shown in FIG. 11 may be combined as needed. 
Specifically, speech switches smaller in number than the 
microphones t^-M,,, may be placed at adequate locations in a 
vehicle so that when one of the speech switches is set on, 
the controller 8 detects the event and initiates speech 
recognition. In this case, the speech switches do not 
completely correspond one-to-one to the microphones M x -M„, so 
that while speech recognition is carried out with the 
structure shown in FIG. 1, the microphone that has picked up 
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an uttered speech and the microphones that have picked up 
noise before speech recognition can be specified before 
speech recognition. This can shorten the processing time 
for determining speech data and noise data suitable for 
5 speech recognition. 

In the case where the structure in FIG. 1 is adapted to 
the case where speech switches smaller in number than the 
microphones M^Mj, are provided, each speech switch may be 
determined as the layout range for the associated microphone 

10 or microphones and one or more microphones belonging to each 
layout range may be specified previously depending on which 
speech switch has been set on. With this structure, those 
which are suitable for speech recognition have only to be 
extracted from pre- specif led single or plural speech frame 

15 data and noise frame data, thus making it possible to 
shorten the processing time. 

Although the foregoing description of this embodiment 
and modifications has been given of a speech recognition 
system adapted to an on-board electronic equipment, the 

20 speech recognition system of this invention can also be 

adapted to other types of electronic apparatuses, such as a 
general -purpose microcomputer system and a so-called word 
processor, to enable voice-based entry of sentences or 
voice -based document edition. 

25 According to this invention, in short, when a speaker 

makes a desired speech, a speech signal and a noise signal 
suitable for speech recognition are automatically determined 



from the individual speech signals output from a plurality 
of voice pickup sections (or voice pickup means) and speech 
recognition is carried out based on the determined speech 
signal and noise signal. Accordingly, the speaker has only 
5 to utter words or voices without consciously making such a 

speech to a specific voice pickup section. This leads to an 
improved operability of the speech recognition system. 

Although only one embodiment of the present invention 
and some modifications thereof have been described herein, 

10 it should be apparent to those skilled in the art that the 
present invention may be embodied in many other specific 
forms without departing from the spirit or scope of the 
invention. Therefore, the present examples and embodiment 
are to be considered as illustrative and not restrictive and 

15 the invention is not to be limited to the details given 

herein, but may be modified within the scope of the appended 
claims . 
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What is claimed is: 

1. A speech recognition system comprising: 

a plurality of voice pickup means for picking up 
uttered voices; 

determination means for determining a speech signal 
suitable for speech recognition from speech signals output 
from said plurality of voice pickup means; and 

speech recognition means for performing speech 
recognition based on said speech signal determined by said 
determination means. 

2 . The speech recognition system according to claim 1 , 
wherein that of said speech signals output from said 
plurality of voice pickup means whose speech level is equal 
to or higher than a predetermined speech level and continues 
over a predetermined period of time is determined as said 
speech signal suitable for speech recognition. 

3 . The speech recognition system according to claim 1 , 
wherein said determination means acquires an average S/N 
value and average voice power of each of said speech signals 
output from said plurality of voice pickup means and 
determines that of said speech signal whose average S/N 
value and average voice power are greater than respective 
predetermined threshold values as said speech signal 
suitable for speech recognition. 

4 . The speech recognition system according to claim 3 , 
wherein said determination means determines a candidate 
order of those speech signals whose average S/N values and 
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average voice powers are greater than said respective 
predetermined threshold values and which are candidates for 
said speech signal suitable for speech recognition, in 
accordance with said average S/N values and average voice 
powers ; and 

said speech recognition means sequentially executes 
speech recognition on said candidates in accordance with 
said candidate order from a highest candidate to a lower one. 

5. The speech recognition system according to any one 
of claims 1 to 4, wherein said determination means treats 
those of said speech signals which are other than said 
speech signal suitable for speech recognition as noise 
signals . 

6 . The speech recognition system according to any one 
of claims 1 to 5 , wherein of other speech signals than said 
speech signal suitable for speech recognition, that speech 
signal whose average S/N value and average voice power 
become minimum is treated as a noise signal by said 
determination means . 

7. A speech recognition system comprising: 

a plurality of voice pickup sections for picking up 
uttered voices; 

a determination section for determining a speech signal 
suitable for speech recognition from speech signals output 
from said plurality of voice pickup sections; and 

a speech recognizer for performing speech recognition 
based on said speech signal determined by said determination 
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section. 

8 . The speech recognition system according to claim 7 , 
wherein that of said speech signals output from said 
plurality of voice pickup sections whose speech level is 
equal to or higher than a predetermined speech level and 
continues over a predetermined period of time is determined 
as said speech signal suitable for speech recognition. 

9. The speech recognition system according to claim 7, 
wherein said determination section acquires an average S/N 
value and average voice power of each of said speech signals 
output from said plurality of voice pickup sections and 
determines that of said speech signal whose average S/N 
value and average voice power are greater than respective 
predetermined threshold values as said speech signal 
suitable for speech recognition. 

10 . The speech recognition system according to claim 9 , 
wherein said determination section determines a candidate 
order of those speech signals whose average S/N values and 
average voice powers are greater than said respective 
predetermined threshold values and which are candidates for 
said speech signal suitable for speech recognition, in 
accordance with said average S/N values and average voice 
powers ; and 

said speech recognizer sequentially executes speech 
recognition on said candidates in accordance with said 
candidate order from a highest candidate to a lower one. 

11. The speech recognition system according to any one 
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of claims 7 to 10, wherein said determination section treats 
those of said speech signals which are other than said 
speech signal suitable for speech recognition as noise 
signals . 

12. The speech recognition system according to any one 
of claims 7 to 11, wherein of other speech signals than said 
speech signal suitable for speech recognition, that speech 
signal whose average S/N value and average voice power 
become minimum is treated as a noise signal by said 
determination section. 

13. A speech recognition method for a speech 
recognition system having a plurality of voice pickup means 
for picking up voices, comprising: 

a voice pickup step of picking up uttered voices using 
said plurality of voice pickup means; 

a determination step of determining a speech signal 
suitable for speech recognition from speech signals output 
from said plurality of voice pickup means; and 

a speech recognition step of performing speech 
recognition based on said speech signal determined by said 
determination step. 

14. The speech recognition method according to claim 
13, wherein that of said speech signals output from said 
plurality of voice pickup means whose speech level is equal 
to or higher than a predetermined speech level and continues 
over a predetermined period of time is determined as said 
speech signal suitable for speech recognition. 
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15. The speech recognition method according to claim 
13, wherein said determination step includes a step of 
acquiring an average S/N value and average voice power of 
each of said speech signals output from said plurality of 
voice pickup means and determining that of said speech 
signal whose average S/N value and average voice power are 
greater than respective predetermined threshold values as 
said speech signal suitable for speech recognition. 

16. The speech recognition method according to claim 
15, wherein said determination step further includes a step 
of determining a candidate order of those speech signals 
whose average S/N values and average voice powers are 
greater than said respective predetermined threshold values 
and which are candidates for said speech signal suitable for 
speech recognition, in accordance with said average S/N 
values and average voice powers; and 

said speech recognition step sequentially executes 
speech recognition on said candidates in accordance with 
said candidate order from a highest candidate to a lower one. 

17. The speech recognition method according to any one 
of claims 13 to 16, wherein said determination step includes 
a step of treating those of said speech signals which are 
other than said speech signal suitable for speech 
recognition as noise signals. 

18. The speech recognition method according to any one 
of claims 13 to 17, wherein of other speech signals than 
said speech signal suitable for speech recognition, that 
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speech signal whose average S/N value and average voice 
power become minimum is treated as a noise signal in said 
determination step. 
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ABSTRACT OF THE DISCLOSURE 

Disclosed are a speech recognition system which 
comprises the following components , and a speech recognition 
method for this speech recognition system. The speech 
5 recognition system comprises a plurality of voice pickup 
sections for picking up uttered voices, a determination 
section for determining a speech signal suitable for speech 
recognition from speech signals output from the plurality of 
voice pickup sections, and a speech recognizer for 
10 performing speech recognition based on the speech signal 
determined by the determination section. 



- 42 - 



FIG .2 A 

IN VEHICLE 
PASSENGER M3 / 




M2 M4 



FIG .3 A 




M4 M6 



FIG .3 B 




^ f 1" 

M2 M4 M6 



S ^ ^21 S^S 



I FIRST FRAME I-- 


T\ { \5 

Q . . . 

O co 


~> 

-L] 
/) 


FIRST FRAME - - 
SECOND FRAME -- 

1 FIRST FRAME 1- - 


SECOND FRAME - ' 
1 FIRST FRAME I-- 


^ 1 

o 

O 

w 

go 


Q 


t 


t 


a | a 


t 

i ° 


t 




h M 


h i 




h M 

o 




i_ _ 


o 


\ ASi 


o 


o 


5 




h 



CO CO CO 00 GO 



o 





D2 
/ 

CH2 ^if^^^il^ 'V » TIME 

ts 

^-^H ? H 

miXTF m2XTF 



/ D3 

FIG .6 D CHs » TIME 

ts 

K-T-H 7 H 

miXTF rmXTF 



FIG 6 E CH4 *» % fr¥rtf p »* ^ 

ts 

K-t-H^ ? 

miXTF m2XTF 



F I G .7 



RANKING 


AVERAGE 
S/N 


AVERAGE 
VOICE 
POWER 


RnkO 


MAXIMUM 
VALUE 


MAXIMUM 
VALUE 


Rnkl 


THD2 OR 
GREATER 


THD3 OR 
GREATER 


Rnk2 


THD2 OR 
GREATER 


SMALLER 
THAN THD3 


Rnk3 


SMALLER 
THAN THD2 


THD3 OR 
GREATER 


Rnk4 


SMALLER 
THAN THD2 


SMALLER 
THAN THD3 



F ! G .8 





CHi 


CH2 


CHs 


CH4 




NOISE CHANNEL 


CASE 1 


RnkO 


Rnkl 


Rnk2 


Rnkl 




CHs 


CASE 2 


RnkO 


Rnkl 


Rnkl 


Rnk2 




CH4 


CASE 3 


RnkO 


Rnkl 


Rnkl 


Rnk3 




CH4 


CASE 4 


RnkO 


Rnkl 


Rnkl 


Rnk4 




CH4 


CASE 5 


RnkO 


Rnkl 


Rnk2 


Rnkl 




CHs 


CASE 6 


RnkO 


Rnkl 


Rnk2 


Rnk2 




CHs 


CASE 7 


RnkO 


Rnkl 


Rnk2 


Rnk3 




CHs 


CASE 8 


RnkO 


Rnkl 


Rnk2 


Rnk4 




CHs 


CASE 9 


RnkO 


Rnkl 


Rnk3 


Rnkl 




CHs 








i 









< ^SPEECH SWITCH ON? 
10^T Tyes ' 



INITIALIZATION 

n=0,m=0, 
F(1)~F(N) = 0 



F I G .9 



106- 



SMPLE DATA AND STORE 
INDIVIDUAL DATA Dl-DN OF 
CHANNELS CHi-CHN IN m-TH 
FRAME AREAS OF MEMORY 
AREAS ME l- MEN 



ONE FRAME OF \_NO 
\DATA Dl-DN STORED?/" 

> n=l 



112\f COMPUTE LPC I 
I RESIDUAL £ n I 



114 ~ X an^THDl? 
11 6 A jYES 



INCREMENT VALUE F (n) OF 
NUMBER - OF - SPEECHES 
COUNTER FCn CORRESPONDING 
TO n-TH CHANNEL 



3_ 



118 
V- 



CLEAR VALUE F (n) 
OF NUMBER -OF -SPEECHES 
COUNTER FCn 
CORRESPONDING TO n-TH 
CHANNEL 



n=N? 
[NO 

n=N+l 



, YES 



~1 



124 



<IS ANY OF F (N) x 
F (1) TO F (N) EQUAL ^ NO 
TO OR GREATER 
THAN "3"? 

|YES 

Y- HAS VALUE OF \ 

NUMBER- OF -SPEECHS COUNTER\ajo 
WHICH IS EQUAL TO OR > — 
GREATER THAN "3" / 

\ REACHED "5" / 

I YES 



TRANSFER SPEECH CANDIDATE 

DATA DCT2 AND NOISE 

CANDIDATE DATA DCT3 TO CONTROLLER 



SPEECH DETECTION COMPLETED 



FIG .1 0 



COMPUTE 
P (n),NP(n), 
AND S/N (n) 



02- 



DETERMINE SPEECH 
FRAME DATA OF 
CHANNEL SUITABLE 
FOR SPEECH RECOGNITION 
BASED ON SPEECH 
CONDITION TABLE IN FIG.7 



204- 



DETERMINE NOISE 
FRAME DATA BASED 
ON NOISE SELECTION 
TABLE IN FIG. 8 



20 6 a| TRANSFER SPEECH 
CANDIDATE DATA 
DCT2 AND NOISE 
CANDIDATE DATA 
DCT3 TO CONTROLLER 



208 -\ SPEECH RECOGNITION 
PROCESS 



210 



ADEQUATE 
RECOGNITION 
RESULT OBTATNEDj) 



> 



NO 



YES 



C END ) 



212 



DESIGNATE NEXT 
CANDIDATE OF SPEECH 
FRAME DATA 



F I G .1 1 



CCi 



MICROPHONE 



MICROPHONE 



MICROPHONEt 



MICROPHONE 



MICROPHONE 




MULTIPLEXER 



TO A/D 
"CONVERTER 



CONTROLLER 



M, M & 0 Docket No. 



Nkaido, Marmelstein, Murray & Oram LLP 



Declaration For U.S. Patent Application 

As a below named inventor, I hereby declare that: ' ■ * 

My residence, post office address and citizenship are as stated below my name. 

I believe I am the original, first and sole inventor (if only one name is listed below) or an original, first and joint inventor (if plural 

names are listed below) of the subject matter which is claimed and for which a patent is sought on the invention entitled 

(Insert Title) _; 

"q r ~^h Rf»r-ngnlt.i on System and Method" 

the specification of which is attached hereto unless the following box is checked: 

□ was filed on as United States Application Number or PCT International 

Application Number and was amended on (if applicable). 



I hereby state that I have reviewed and understand the contents of the above-identified specification, including the claim(s), as amended 
by any amendment referred to above. 

I acknowledge the duty to disclose information which is material to patentability as defined in 37 C.F.R. §1.56. 
I hereby claim foreign priority benefits under 35 U.S.C. §1 19(a)-(d) or §365(b) of any foreign applications) for patent or inventor's 
certificate, or §365(a) of any PCT International application which designated at least one country other than the United States, listed 
below and have also identified below any foreign application for patent or inventor's certificate or PCT International Application having 
a filing date before that of the application^) for which priority is claimed: 

11-2*6393 Japan 31/08/1999 

- • (Number) (Country) (Day/Month/Year Filed) 



foreign 
applicatioi 



(Day/Month/Year Filed) 

□ Yes □ No 



this page) (Number) (Country) (Day/Month/Year Filed) 

I hereby claim the benefit under 35 U.S.C. § 119(e) of any United States provisional application(s) listed below. 



(Application Number) (Filing Date) 

(See Note B on back □ See attached list for additional prior foreign or provisional applications. 

of this page) r 

I hereby claim the benefit under 35 U.S.C. §120 of any United States application® or §365(c) of any PCT International application(s) 
designating the United States of America listed below and, insofar as the subject matter of each of the claims of this application is not 
disclosed in the prior application(s) (U.S. or PCT) in the manner provided by the first paragraph of 35, U.S.C. §112, 1 acknowledge 
the duty to disclose information which is material to patentability as defined in 37 C.F.R. §1.56 which became available between the 
filing date of the prior application and the national or PCT International filing date of this application. 



Applications or (Application Serial No.) (Filing Date) (Status) (patented, pending, abandoned) 



PCT International 
applications 



designating the U.S.) (Application Serial No.) 



And I hereby appoint as principal attorneys David T. Nikaido, Reg. No. 22,663; Charles M. Marmelstein, Reg. No. 25,895; George 
E. Oram, Jr., Reg. No. 27,931; Robert B. Murray, Reg. No. 22,980; Martin S. Postman, Reg. No. 18,570; E. Marcie Emas, Reg. 
No. 32,131; Douglas H. Goldhush, Reg. No. 33,125; Kevin C. Brown, Reg. No. 32,402; Monica Chin Kitts, Reg. No. 36,105; and 
Richard J. Berman, Reg. No. 39,107. 

Please direct all communications to the following address: NIKAIDO, MARMELSTEIN, MURRAY & ORAM LLP 

Metropolitan Square 

655 Fifteenth Street, N.W., Suite 330 - G Street Lobby 
Washington, D.C. 20005-5701 
(202) 638-5000 Fax: (202) 638-4810 

I hereby declare that all statements made herein of my own knowledge are true and that all statements made on information and belief 
are believed to be true; and further, that these statements were made with the knowledge that willful false statements and the like so 
made are punishable by fine or imprisonment, or both, under Section 1001 of Title 18 of the United States Code and that such willful 
false statements may jeopardize the validity of the application or any patent issued thereon. 

(See Note c Full name of sole or first inventor Shoutarou YODA 

on back of « 

mis page) Inventor's signature ,<M,mX^A^ Wd, August 25, 

o • ^ i ^ Date 
Residence Sai tama-ken , Japan 

Japan 

c/o Kawa§8rtbTijT-n : 

Post Office Address : Pioneer Corporation, 25-1 Aza Nishimachi , 

Yamada, Kawagoo-sh.i , Saitama-ken 350-8555 Japan 
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Shoutarou YODA 
Serial No.: New Application 
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For: SPEECH RECOGNITION SYSTEM AND METHOD 



NOTIFICATION OF CHANGE OF NAME AND ADDRESS 
Commissioner for Patents 

Washington, D. C. 20231 August 30, 2000 

Sir: 

It is respectfully requested thatthe correspondence address forthe above-identified 
application be changed to the following: 



ARENT FOX KINTNER PLOTKIN & KAHN, PLLC 
1050 Connecticut Avenue, N.W. 

Suite 600 
Washington, D. C. 20036-5339 
Tel: (202) 857-6000 
Fax: (202) 638-4810 



In the event that any fees are due with respect to this paper, please charge our 
Deposit Account No. 01-2300. 

Respectfully submitted, 
ARENT FOXKJNm^^kQTKIN & KAHN, PLLC 



1/ David T. Nikaido 
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1050 Connecticut Avenue, N.W. 
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Washington, D. C. 20036-5339 
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